Regresssion with scikit-learn
using Soccer Dataset
We will again be using the open dataset from the popular site Kaggle that we used in Week 1 for our example.
Recall that this European Soccer Database has more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016.
Note: Please download the file database.sqlite if you don't yet have it in your Week-7-MachineLearning folder.
Import Libraries
import sqlite3
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
Read Data from the Database into pandas
# Create your connection.
cnx = sqlite3.connect('database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)
df.head()
df.shape
df.columns
Declare the Columns You Want to Use as Features
features = [
'potential', 'crossing', 'finishing', 'heading_accuracy',
'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
'gk_reflexes']
Specify the Prediction Target
target = ['overall_rating']
Clean the Data
df = df.dropna()
Extract Features and Target ('overall_rating') Values into Separate Dataframes
X = df[features]
y = df[target]
Let us look at a typical row from our features:
X.iloc[2]
Let us also display our target values:
y
Split the Dataset into Training and Test Datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
(1) Linear Regression: Fit a model to the training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Perform Prediction using Linear Regression Model
y_prediction = regressor.predict(X_test)
y_prediction
What is the mean of the expected target value in test set ?
y_test.describe()
Evaluate Linear Regression Accuracy using Root Mean Square Error
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
print(RMSE)
(2) Decision Tree Regressor: Fit a new regression model to the training set
regressor = DecisionTreeRegressor(max_depth=20)
regressor.fit(X_train, y_train)
Perform Prediction using Decision Tree Regressor
y_prediction = regressor.predict(X_test)
y_prediction
For comparision: What is the mean of the expected target value in test set ?
y_test.describe()
Evaluate Decision Tree Regression Accuracy using Root Mean Square Error
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
print(RMSE)